Cleaning

At this stage, we use the file we had saved in the input_data folder to create datetime variables,

[1]:
use "..\input_data\NYPD_Motor_Vehicle_Collisions",clear

Notice that we used ..\ here again. When this jupyter notebook is opened, its current directory is in the “dofiles” folder. Therefore, we use the relative paths to “move” to the “input_data” to get the dataset.

[2]:
notes list

_dta:
  1.  "Downloaded 9 Dec 2018 11:30:26"

browse in your Stata, to have an idea of the data

[3]:
*browse

Re-order variables

Personally, I prefer ID type variables to be in the first few columns of the dataset. We will use Stata’s order command to move the variable uniquekey to the first column.

[4]:
order(uniquekey), before(date)

Create date time variables

  1. Preview the date variable (through list, codebook, browse(Data Editor))
  2. Determine the format of the date variable
    • Is it a string, int, float, etc
  3. Determine if it is in other types of date format (e.g. Excel’s date format)
[5]:
list date in 1/4

     +------------+
     |       date |
     |------------|
  1. | 11/16/2018 |
  2. | 11/16/2018 |
  3. | 11/16/2018 |
  4. | 11/16/2018 |
     +------------+
[6]:
codebook date

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
date                                                                                                                                                                                                                                                       DATE
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  string (str105)

         unique values:  2,331                    missing "":  0/1,386,003

              examples:  "03/24/2015"
                         "06/08/2017"
                         "08/15/2015"
                         "10/20/2013"

               warning:  variable has leading and embedded blanks

Now, we determine that the date variable is in a string format. We will use Stata’s HRF-to-SIF conversion functions in the datetime command.

[7]:
gen date_SIF = date(date, "MDY"),after(date)
(1 missing value generated)

We know that the date is a string variable, and it is in HRF( Human readable form) in the form of Month/Day/Year. We tell Stata’s date() function that through the mask "MDY". Then, I tell Stata to put date_SIF in the column after date.

[8]:
list date_SIF in 1/4

     +----------+
     | date_SIF |
     |----------|
  1. |    21504 |
  2. |    21504 |
  3. |    21504 |
  4. |    21504 |
     +----------+

The variable is in SIF(Stata internal form) now. So Stata understands it is a date. But in order for us to understand, we tell Stata to display this number in HRF through the format command. The date format is %td and the time format is %tc.

[9]:
format date_SIF %td

We repeat this for time variable.

[10]:
gen time_SIF = clock(time, "hm"),after(time)
format time_SIF %tc

(1 missing value generated)

Why do we want Stata to understand dates and times?

Because it allows us to do so much more! - view summary statistics of the date (e.g. first date, last date) - create year and month variables - create dummies based on durations - create dummies based on boolean logic

[11]:
sum date_SIF, format

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    date_SIF |  1,386,002   14oct2015    669.6435  01jul2012  16nov2018
[12]:
gen year_SIF = year(date_SIF) // Extract year from the date
(1 missing value generated)
[13]:
gen month_SIF = month(date_SIF) // Extract month from the date
(1 missing value generated)
[14]:
gen special_period = (date_SIF >= td(14feb2014) & date_SIF <= td(14march2014))
tab special_period



special_per |
        iod |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |  1,370,768       98.90       98.90
          1 |     15,235        1.10      100.00
------------+-----------------------------------
      Total |  1,386,003      100.00

Renaming long variable names

This is entirely up to you. I prefer a shorter variable name, and use labels to store the longer and more descriptive name. Labeling is the more important thing to do when you begin this process of building an analytical dataset.

[15]:
rename numberofcyclistinjured cyclistinjured
rename numberofcyclistkilled cyclistkilled
[16]:
label variable cyclistinjured "Number of cyclists injured"
label variable cyclistkilled "Number of cyclists killed"

Creating dummies from long strings

I will show you several methods to finding keywords from string variables. For example, we want to create a variable to indicate if a collision occurred with a bicycle. - strpos - contains exact “string”

  1. Explore the string variable that we want to extract information from.
[17]:
tab vehicletypecode1

                VEHICLE TYPE CODE 1 |      Freq.     Percent        Cum.
------------------------------------+-----------------------------------
                              (ceme |          1        0.00        0.00
                              15 Pa |          1        0.00        0.00
                              2 TON |          1        0.00        0.00
                         2 dr sedan |          6        0.00        0.00
                             3-Door |         30        0.00        0.00
                              315 e |          1        0.00        0.00
                                 3D |         32        0.00        0.01
                         4 dr sedan |        202        0.01        0.02
                               4DSD |          3        0.00        0.02
                              4whee |          1        0.00        0.02
                              99999 |          1        0.00        0.02
                              ABULA |          1        0.00        0.02
                              ACCES |          2        0.00        0.02
                                 AM |        292        0.02        0.04
                              AMABU |          1        0.00        0.04
                                AMB |          7        0.00        0.04
                               AMBU |          5        0.00        0.04
                              AMBUL |        111        0.01        0.05
                          AMBULANCE |      2,406        0.17        0.23
                              AMbul |          1        0.00        0.23
                              APORT |          1        0.00        0.23
                              APPOR |          1        0.00        0.23
                                 AR |         41        0.00        0.23
                              ARMOR |          1        0.00        0.23
                               ARMY |          1        0.00        0.23
                              Ambul |         20        0.00        0.23
                          Ambulance |        265        0.02        0.25
                      Armored Truck |         42        0.00        0.25
                                 BA |          6        0.00        0.25
                               BACK |          2        0.00        0.25
                              BACKH |          4        0.00        0.25
                              BED T |          1        0.00        0.25
                              BICYC |          1        0.00        0.25
                            BICYCLE |      5,567        0.40        0.66
                               BOAT |          1        0.00        0.66
                              BOBCA |          3        0.00        0.66
                                BOX |          1        0.00        0.66
                              BOX T |          5        0.00        0.66
                                 BR |         42        0.00        0.66
                                 BS |          2        0.00        0.66
                                 BU |      2,229        0.16        0.82
                              BULLD |          2        0.00        0.82
                                BUS |     14,057        1.02        1.85
                                BUs |          1        0.00        1.85
                              Backh |          1        0.00        1.85
                     Beverage Truck |         31        0.00        1.85
                               Bike |      1,912        0.14        1.99
                               Boom |          1        0.00        1.99
                                Box |          1        0.00        1.99
                          Box Truck |      2,676        0.19        2.18
                              Box t |          1        0.00        2.18
                              Bucke |          1        0.00        2.18
                   Bulk Agriculture |          4        0.00        2.18
                                Bus |      1,992        0.14        2.33
                              CABIN |          1        0.00        2.33
                               CAMP |          1        0.00        2.33
                              CARGO |          2        0.00        2.33
                               CART |          1        0.00        2.33
                                 CB |         89        0.01        2.34
                              CHERR |          1        0.00        2.34
                              CHEVY |          1        0.00        2.34
                                 CM |         41        0.00        2.34
                               CMIX |          1        0.00        2.34
                                 CO |          2        0.00        2.34
                                COM |          8        0.00        2.34
                              COM T |          1        0.00        2.34
                               COM. |          1        0.00        2.34
                               COMB |          1        0.00        2.34
                              COMME |         11        0.00        2.34
                              COMMM |          1        0.00        2.34
                              CONCR |          1        0.00        2.34
                              CONST |          1        0.00        2.34
                               CONV |        398        0.03        2.37
                              COUPE |          1        0.00        2.37
                              CRANE |          1        0.00        2.37
                              CUSHM |          1        0.00        2.37
                              Cargo |          1        0.00        2.37
                          Carry All |        178        0.01        2.38
                              Cat 9 |          1        0.00        2.38
                        Chassis Cab |         83        0.01        2.39
                               Cmix |          1        0.00        2.39
                              Comix |          1        0.00        2.39
                               Comm |          1        0.00        2.39
                              Comme |          1        0.00        2.39
                     Concrete Mixer |         67        0.00        2.39
                        Convertible |        428        0.03        2.42
                              DELIV |         14        0.00        2.43
                               DELV |          2        0.00        2.43
                              DEMA- |          1        0.00        2.43
                               DIRT |          1        0.00        2.43
                              DOT R |          1        0.00        2.43
                                 DP |        367        0.03        2.45
                                 DS |      1,006        0.07        2.53
                               DUMP |         17        0.00        2.53
                              DUMPT |          1        0.00        2.53
                              DUNBA |          1        0.00        2.53
                              Deliv |          1        0.00        2.53
                               Dump |        414        0.03        2.56
                              Dumps |          1        0.00        2.56
                              E BIK |          1        0.00        2.56
                              E ONE |          1        0.00        2.56
                              E SCO |          1        0.00        2.56
                              E-BIK |          4        0.00        2.56
                              E-Bik |          2        0.00        2.56
                              E-MOT |          1        0.00        2.56
                              E/BIK |          1        0.00        2.56
                              EBIKE |          2        0.00        2.56
                              ELECT |         15        0.00        2.56
                              EMRGN |          1        0.00        2.56
                              EMS A |          1        0.00        2.56
                              EMS H |          1        0.00        2.56
                                 EN |          2        0.00        2.56
                                EPO |          1        0.00        2.56
                              Elect |          3        0.00        2.56
Enclosed Body - Removable Enclosure |          2        0.00        2.56
                                 FB |        238        0.02        2.58
                              FD NY |          1        0.00        2.58
                              FD tr |          1        0.00        2.58
                               FDNY |         34        0.00        2.58
                              FED E |          2        0.00        2.58
                               FIRE |         25        0.00        2.58
                         FIRE TRUCK |        850        0.06        2.64
                              FIRET |         10        0.00        2.64
                               FLAT |          9        0.00        2.65
                              FLATB |          1        0.00        2.65
                               FORD |          4        0.00        2.65
                               FORK |          2        0.00        2.65
                              FORK- |          1        0.00        2.65
                              FORKL |         11        0.00        2.65
                                 FR |         34        0.00        2.65
                               FREE |          1        0.00        2.65
                              FREIG |          6        0.00        2.65
                              FRONT |          1        0.00        2.65
                              Fd fi |          1        0.00        2.65
                               Fdny |          2        0.00        2.65
                               Fire |         15        0.00        2.65
                              Firet |          1        0.00        2.65
                           Flat Bed |        226        0.02        2.67
                          Flat Rack |         51        0.00        2.67
                               Ford |          1        0.00        2.67
                              Forkl |          2        0.00        2.67
                              Freig |          1        0.00        2.67
                              GARBA |          4        0.00        2.67
                              GE/SC |          1        0.00        2.67
                                 GG |        205        0.01        2.69
                               GOLF |          1        0.00        2.69
                              GOVER |          1        0.00        2.69
                                 GR |          2        0.00        2.69
                               GRAY |          1        0.00        2.69
                              Garba |          2        0.00        2.69
                  Garbage or Refuse |        223        0.02        2.70
                         Glass Rack |          1        0.00        2.70
                               Golf |          2        0.00        2.70
                               H/WH |          1        0.00        2.70
                              HEAVY |          1        0.00        2.70
                               HELP |          1        0.00        2.70
                                 HO |          1        0.00        2.70
                              HORSE |          2        0.00        2.70
                              HWY C |          1        0.00        2.70
                               Hand |          1        0.00        2.70
                             Hopper |          3        0.00        2.70
                              Humme |          1        0.00        2.70
                                 IP |          1        0.00        2.70
                              Ice C |          1        0.00        2.71
                              Inter |          1        0.00        2.71
                               John |          1        0.00        2.71
     LARGE COM VEH(6 OR MORE TIRES) |     14,527        1.06        3.76
                                 LF |         16        0.00        3.76
                              LIBER |          1        0.00        3.76
                               LIMO |          3        0.00        3.76
                              LIVER |          2        0.00        3.76
                     LIVERY VEHICLE |     10,481        0.76        4.53
                                 LL |        169        0.01        4.54
                              LOADE |          1        0.00        4.54
                                 LP |          1        0.00        4.54
                                 LW |          2        0.00        4.54
                               Lift |          1        0.00        4.54
                          Lift Boom |         19        0.00        4.54
                              Light |          1        0.00        4.54
                              Limou |          1        0.00        4.54
                     Livestock Rack |          2        0.00        4.54
                                Log |          1        0.00        4.54
                        Lunch Wagon |          4        0.00        4.54
                               MACK |          2        0.00        4.54
                               MAIL |          4        0.00        4.54
                               MARK |          1        0.00        4.54
                                 MB |         20        0.00        4.54
                                 MD |         32        0.00        4.54
                                 MH |          3        0.00        4.54
                              MILLI |          1        0.00        4.54
                               MINI |          3        0.00        4.54
                                 MK |          1        0.00        4.54
                              MOPED |         10        0.00        4.55
                              MOTOR |          6        0.00        4.55
                         MOTORCYCLE |      6,536        0.48        5.02
                                 MS |         42        0.00        5.02
                              MTA B |          4        0.00        5.02
                              Marke |          1        0.00        5.02
                               Mini |          1        0.00        5.02
                           Minibike |          7        0.00        5.03
                          Minicycle |          4        0.00        5.03
                              Mo pa |          1        0.00        5.03
                              Moped |         56        0.00        5.03
                              Motor |          2        0.00        5.03
                          Motorbike |         41        0.00        5.03
                         Motorcycle |        877        0.06        5.10
                     Motorized Home |          4        0.00        5.10
                       Motorscooter |         62        0.00        5.10
              Multi-Wheeled Vehicle |         15        0.00        5.10
                                N/A |          1        0.00        5.10
                              NEW Y |          2        0.00        5.10
                              NS AM |          1        0.00        5.10
                              NYC B |          1        0.00        5.10
                              NYC M |          1        0.00        5.10
                              NYC a |          1        0.00        5.10
                               NYPD |          2        0.00        5.10
                              OIL T |          2        0.00        5.10
                                OML |          1        0.00        5.10
                              OMNIB |          1        0.00        5.10
                                OMR |          5        0.00        5.10
                                OMT |          2        0.00        5.10
                              OTHER |     23,972        1.74        6.85
                          Open Body |          5        0.00        6.85
                               P/SE |          1        0.00        6.85
                               P/SH |          2        0.00        6.85
                                PAS |          2        0.00        6.85
                  PASSENGER VEHICLE |    715,227       52.00       58.85
                              PEDIC |          1        0.00       58.85
                            PEDICAB |         35        0.00       58.85
                              PICK- |          1        0.00       58.85
                      PICK-UP TRUCK |     23,069        1.68       60.53
                              PICKU |          4        0.00       60.53
                                 PK |        405        0.03       60.56
                                 PL |          6        0.00       60.56
                                 PM |          7        0.00       60.56
                              POSTA |          4        0.00       60.56
                              POSTO |          1        0.00       60.56
                              POWER |          7        0.00       60.56
                                PSD |          4        0.00       60.56
                                PSR |          1        0.00       60.56
                               PUMP |          1        0.00       60.56
                             Pallet |          7        0.00       60.56
                            Pedicab |         11        0.00       60.56
                               Pick |          1        0.00       60.56
                              Pick- |          1        0.00       60.56
                      Pick-up Truck |      3,883        0.28       60.85
                              Picku |          1        0.00       60.85
         Pickup with mounted Camper |          5        0.00       60.85
                              Porta |          1        0.00       60.85
                                R/V |          1        0.00       60.85
                              R/V C |          1        0.00       60.85
                              REFRI |          3        0.00       60.85
                              RENTA |          1        0.00       60.85
                                REP |          2        0.00       60.85
                              RESCU |          1        0.00       60.85
                                 RF |         56        0.00       60.85
                                 RV |         11        0.00       60.85
                              RYDER |          1        0.00       60.85
                   Refrigerated Van |         58        0.00       60.86
                              SANIT |          6        0.00       60.86
                              SCHOO |         16        0.00       60.86
                              SCOOT |         10        0.00       60.86
                            SCOOTER |        256        0.02       60.88
                                 SE |          1        0.00       60.88
                               SELF |          4        0.00       60.88
                               SEMI |          2        0.00       60.88
                              SEMI- |          1        0.00       60.88
                              SKATE |          3        0.00       60.88
            SMALL COM VEH(4 TIRES)  |     14,559        1.06       61.94
                              SMART |          1        0.00       61.94
                               SNOW |          1        0.00       61.94
                                 SP |         20        0.00       61.94
                                SPC |          2        0.00       61.94
      SPORT UTILITY / STATION WAGON |    313,498       22.79       84.73
                              SPRIN |          1        0.00       84.73
                                 ST |         15        0.00       84.74
                              ST150 |          1        0.00       84.74
                               STAK |          5        0.00       84.74
                              STREE |          6        0.00       84.74
                              SUBN/ |          1        0.00       84.74
                              SUBUR |          1        0.00       84.74
                                SUV |          3        0.00       84.74
                              SWEEP |          1        0.00       84.74
                              Sanit |          1        0.00       84.74
                              Schoo |          1        0.00       84.74
                         School Bus |          9        0.00       84.74
                              Scoot |          1        0.00       84.74
                              Sedan |     56,607        4.12       88.85
                               Semi |          1        0.00       88.85
                              Semi- |          1        0.00       88.85
                              Skate |          1        0.00       88.85
                              Smart |          1        0.00       88.85
                          Snow Plow |          1        0.00       88.85
                                Spc |          1        0.00       88.85
                              Sprin |          2        0.00       88.85
                      Stake or Rack |         11        0.00       88.85
Station Wagon/Sport Utility Vehicle |     43,706        3.18       92.03
                               Subn |          1        0.00       92.03
                              Sweep |          1        0.00       92.03
                               TANK |          2        0.00       92.03
                              TANKE |          2        0.00       92.03
                               TAXI |     50,669        3.68       95.72
                                 TK |      2,485        0.18       95.90
                                 TN |         77        0.01       95.90
                                TOW |          3        0.00       95.90
                              TOW T |         13        0.00       95.90
                              TOWTR |          2        0.00       95.90
                                 TR |        160        0.01       95.92
                               TRAC |          3        0.00       95.92
                              TRACK |          1        0.00       95.92
                              TRACT |         21        0.00       95.92
                              TRAFF |          1        0.00       95.92
                              TRAIL |         52        0.00       95.92
                              TRANS |          3        0.00       95.92
                                TRK |          2        0.00       95.92
                                TRL |          3        0.00       95.92
                              TRLPM |          1        0.00       95.92
                               TRLR |          2        0.00       95.92
                              TRUCK |         45        0.00       95.93
                                 TT |        130        0.01       95.94
                             Tanker |         76        0.01       95.94
                               Taxi |      5,894        0.43       96.37
                                Tow |          1        0.00       96.37
                              Tow T |          1        0.00       96.37
                          Tow Truck |          4        0.00       96.37
                Tow Truck / Wrecker |        125        0.01       96.38
                              Tow t |          1        0.00       96.38
                              Tow-t |          1        0.00       96.38
                              Tract |          3        0.00       96.38
               Tractor Truck Diesel |      1,175        0.09       96.46
             Tractor Truck Gasoline |        159        0.01       96.48
                              Trail |          5        0.00       96.48
                              Truck |          2        0.00       96.48
                               U.S. |          1        0.00       96.48
                              UHAUL |          4        0.00       96.48
                              UHUAL |          1        0.00       96.48
                                UKN |          1        0.00       96.48
                              UNKNO |          1        0.00       96.48
                            UNKNOWN |     19,929        1.45       97.93
                              UPS T |          1        0.00       97.93
                              UPS t |          1        0.00       97.93
                              US PO |          1        0.00       97.93
                               USPS |         16        0.00       97.93
                                 UT |          1        0.00       97.93
                               UTIL |          4        0.00       97.93
                              UTILI |          6        0.00       97.93
                                UTV |          1        0.00       97.93
                              Unkno |          1        0.00       97.93
                               Util |          1        0.00       97.93
                              Utili |          2        0.00       97.93
                                VAN |     26,540        1.93       99.86
                              VAN T |          1        0.00       99.86
                              VAN/T |          2        0.00       99.86
                               VANG |          1        0.00       99.86
                                 VC |          2        0.00       99.86
                                 VN |        712        0.05       99.91
                                 VT |          2        0.00       99.91
                                Van |        906        0.07       99.98
                         Van Camper |          4        0.00       99.98
                            Vanette |          1        0.00       99.98
                              Veriz |          1        0.00       99.98
                              WAGON |          1        0.00       99.98
                                 WD |          1        0.00       99.98
                               WHBL |          1        0.00       99.98
                              WHEEL |          1        0.00       99.98
                              WHITE |          1        0.00       99.98
                               WORK |          2        0.00       99.98
                              Wagon |          1        0.00       99.98
                       Well Driller |          1        0.00       99.98
                              Wheel |          1        0.00       99.98
                               Work |          1        0.00       99.98
                              YELLO |          2        0.00       99.98
                               ambu |          5        0.00       99.98
                              ambul |         21        0.00       99.98
                              armor |          1        0.00       99.98
                              box t |          7        0.00       99.98
                              bulld |          1        0.00       99.98
                                bus |          7        0.00       99.98
                              cargo |          1        0.00       99.98
                               cate |          1        0.00       99.98
                              cemen |          1        0.00       99.98
                              chevo |          1        0.00       99.98
                                com |          2        0.00       99.98
                              comme |          3        0.00       99.98
                              deliv |          3        0.00       99.98
                               delv |          3        0.00       99.98
                                 dp |          1        0.00       99.98
                               dsny |          1        0.00       99.98
                               dump |          5        0.00       99.98
                              e BIK |          1        0.00       99.98
                              e amb |          1        0.00       99.98
                              e com |          1        0.00       99.98
                              e sco |          1        0.00       99.98
                              e-bik |          2        0.00       99.98
                               east |          1        0.00       99.98
                              elect |          3        0.00       99.98
                               f550 |          1        0.00       99.98
                              fd tr |          1        0.00       99.98
                               fdny |          8        0.00       99.98
                               fire |         13        0.00       99.99
                              firet |          1        0.00       99.99
                               flat |          1        0.00       99.99
                              flatb |          1        0.00       99.99
                               ford |          2        0.00       99.99
                               fork |          1        0.00       99.99
                              forkl |          4        0.00       99.99
                              garba |          3        0.00       99.99
                              gator |          1        0.00       99.99
                              ice c |          1        0.00       99.99
                              icecr |          1        0.00       99.99
                              ladde |          1        0.00       99.99
                              light |          1        0.00       99.99
                               mail |          1        0.00       99.99
                                mcy |          1        0.00       99.99
                              mopad |          1        0.00       99.99
                              motor |          3        0.00       99.99
                                mta |          1        0.00       99.99
                              mta b |          2        0.00       99.99
                              nyc a |          1        0.00       99.99
                               omni |          1        0.00       99.99
                               p/sh |          1        0.00       99.99
                                pas |          1        0.00       99.99
                              passa |          1        0.00       99.99
                               pick |          4        0.00       99.99
                              posta |          2        0.00       99.99
                              power |          1        0.00       99.99
                               rd/s |          1        0.00       99.99
                               refg |          1        0.00       99.99
                                 rv |          2        0.00       99.99
                              sanit |          2        0.00       99.99
                              schoo |          7        0.00       99.99
                              sciss |          1        0.00       99.99
                              scoot |          2        0.00       99.99
                              seagr |          1        0.00       99.99
                               self |          1        0.00       99.99
                               semi |          2        0.00       99.99
                               sgws |          1        0.00       99.99
                              spc p |          1        0.00       99.99
                               spec |          1        0.00       99.99
                              stree |          1        0.00       99.99
                               subn |          3        0.00       99.99
                                 tk |          1        0.00       99.99
                                tow |          2        0.00       99.99
                              tow t |          3        0.00       99.99
                               trac |          1        0.00       99.99
                              tract |          6        0.00       99.99
                              trail |         17        0.00       99.99
                                trk |          1        0.00       99.99
                               trlr |          1        0.00       99.99
                              truck |          9        0.00       99.99
                              uhaul |          1        0.00       99.99
                              ulili |          1        0.00       99.99
                                unk |          3        0.00       99.99
                              unkno |          4        0.00       99.99
                              ups t |          1        0.00       99.99
                               usps |          2        0.00       99.99
                              utili |          6        0.00       99.99
                                van |         75        0.01      100.00
                              van t |          1        0.00      100.00
                                vol |          1        0.00      100.00
------------------------------------+-----------------------------------
                              Total |  1,375,333      100.00
  1. Find some key words. E.g. BICYC, BICYCLE, Bike
[18]:
gen bicyclerelated = vehicletypecode1 == "BICYC" | ///
vehicletypecode1 == "BICYCLE" | ///
vehicletypecode1 == "Bike"
[19]:
tab bicyclerelated

bicyclerela |
        ted |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |  1,378,523       99.46       99.46
          1 |      7,480        0.54      100.00
------------+-----------------------------------
      Total |  1,386,003      100.00

The above method requires that vehicletypecode1 have exactly those characters within the double quotation marks. If “BICYC” were spelt in lower capitals, the command above will not capture it. Hence, we can “improve” upon the code by using lower() or upper() functions to ensure that the strings are in the right form.

[20]:
gen bicyclerelated_2 = upper(vehicletypecode1) == "BICYC" | ///
upper(vehicletypecode1) == "BICYCLE" | ///
upper(vehicletypecode1) == "BIKE"
[21]:
tab bicyclerelated_2

bicyclerela |
      ted_2 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |  1,378,523       99.46       99.46
          1 |      7,480        0.54      100.00
------------+-----------------------------------
      Total |  1,386,003      100.00

The method above is tedious because it requires that you look for every single keyword that is related to a bicycle. For example, in order to identify Minibike, you need to know that it exists in vehicletypecode1. We can use Stata’s strpos function to look within a string for a keyword. If strpos finds the keyword, the function will return the value of the position in the string. But instead of storing the position, we will use a boolean logic to identify that observation and create the dummy variable.

[22]:
gen bicyclerelated_3 = strpos(upper(vehicletypecode1),"BICYC" ) > 0 | ///
strpos(upper(vehicletypecode1),"BIKE" ) > 0
[23]:
tab bicyclerelated_3

bicyclerela |
      ted_3 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |  1,378,473       99.46       99.46
          1 |      7,530        0.54      100.00
------------+-----------------------------------
      Total |  1,386,003      100.00

Next, we can also use the screening command. This command is a little more complicated but it allows for spelling mistakes to be captured as well.

[24]:
screening ,sources(vehicletypecode1, upper) keys(BICYCLE BIKE) letters(5 3) explore(count)

             Source |   Key |      Freq. | Percent
--------------------+-------+------------+---------
   vehicletypecode1 | BICYC |       5568 |   73.84
                    |   BIK |       1973 |   26.16
--------------------+-------+------------+---------
                    | Total |       7541 |  100.00

[25]:
screening ,sources(vehicletypecode1, upper) keys(BICYCLE BIKE) letters(5 3) newcode(bicyclerelated_4)
[26]:
tab bicyclerelated_4

bicyclerela |
      ted_4 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      5,568       73.84       73.84
          2 |      1,973       26.16      100.00
------------+-----------------------------------
      Total |      7,541      100.00

Explore the coodinates

[27]:
codebook latitude

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
latitude                                                                                                                                                                                                                                               LATITUDE
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [0,43.344444]                units:  1.000e-06
         unique values:  70,056                   missing .:  236,251/1,386,003

                  mean:   40.7115
              std. dev:   .706688

           percentiles:        10%       25%       50%       75%       90%
                            40.619   40.6687   40.7229   40.7674    40.837
[28]:
codebook longitude

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
longitude                                                                                                                                                                                                                                             LONGITUDE
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [-201.35999,0]               units:  1.000e-06
         unique values:  48,842                   missing .:  236,251/1,386,003

                  mean:  -73.9102
              std. dev:   1.72118

           percentiles:        10%       25%       50%       75%       90%
                           -74.003  -73.9782  -73.9311  -73.8678  -73.8049

When we examine the codebook outputs. It is helpful to look at the unique values and the missing. From this we know that these coordinates are not all unique and that there are missing observations in the data.

We will try to fill in the missing latitude and logitude based on other characteristics about the intersection. But first we will create a dummy variable so that we can keep track of observations with missing longitude and latitudes.

[29]:
gen mi_latlong = mi(latitude) & mi(longitude)

Now we can see the number of missing over the years

[30]:
tab mi_latlong year_SIF, column

+-------------------+
| Key               |
|-------------------|
|     frequency     |
| column percentage |
+-------------------+

           |                                   year_SIF
mi_latlong |      2012       2013       2014       2015       2016       2017       2018 |     Total
-----------+-----------------------------------------------------------------------------+----------
         0 |    85,451    171,915    172,726    182,957    136,345    212,986    187,372 | 1,149,752
           |     84.99      84.39      83.84      84.04      59.71      92.84      93.58 |     82.95
-----------+-----------------------------------------------------------------------------+----------
         1 |    15,089     31,809     33,302     34,734     92,019     16,437     12,860 |   236,250
           |     15.01      15.61      16.16      15.96      40.29       7.16       6.42 |     17.05
-----------+-----------------------------------------------------------------------------+----------
     Total |   100,540    203,724    206,028    217,691    228,364    229,423    200,232 | 1,386,002
           |    100.00     100.00     100.00     100.00     100.00     100.00     100.00 |    100.00

The column option tells stata to provide percentages by each year.

There are several key variables that we can use to determine if the intersections are the same. - borough - zipcode - onstreetname - crossstreetname

We will use Stata’s variable[_n-1] code to compare between rows of the observations. The idea is that if the borough, zipcode, onstreetname and crossstreetname of the previous row of observation is the same as the borough of the current row of observation, then the intersection has to be the same. Therefore, if the latitude and longitude are missing, we can fill it in with the non-missing latitude and longitude from the previous row.

First, we want to sort the data in order.

[31]:
sort onstreetname crossstreetname latitude longitude
[32]:
replace latitude = latitude[_n-1] if ///
mi(latitude) & ///
(borough == borough[_n-1]) & ///
(zipcode == zipcode[_n-1]) & ///
(onstreetname == onstreetname[_n-1]) & ///
(crossstreetname ==  crossstreetname[_n-1])

replace longitude = longitude[_n-1] if ///
mi(longitude) & ///
(borough == borough[_n-1]) & ///
(zipcode == zipcode[_n-1]) & ///
(onstreetname == onstreetname[_n-1]) & ///
(crossstreetname ==  crossstreetname[_n-1])

(74,517 real changes made)

(74,517 real changes made)

We managed to “back fill” quite a large number of observations! Now, we want to identify observations that still have missing latitude and longitudes.

[33]:
gen mi_latlong2 = mi(latitude) & mi(longitude)
[34]:
tab mi_latlong2 year_SIF, column

+-------------------+
| Key               |
|-------------------|
|     frequency     |
| column percentage |
+-------------------+

mi_latlong |                                   year_SIF
         2 |      2012       2013       2014       2015       2016       2017       2018 |     Total
-----------+-----------------------------------------------------------------------------+----------
         0 |    86,719    174,277    175,092    185,272    188,838    220,422    193,649 | 1,224,269
           |     86.25      85.55      84.98      85.11      82.69      96.08      96.71 |     88.33
-----------+-----------------------------------------------------------------------------+----------
         1 |    13,821     29,447     30,936     32,419     39,526      9,001      6,583 |   161,733
           |     13.75      14.45      15.02      14.89      17.31       3.92       3.29 |     11.67
-----------+-----------------------------------------------------------------------------+----------
     Total |   100,540    203,724    206,028    217,691    228,364    229,423    200,232 | 1,386,002
           |    100.00     100.00     100.00     100.00     100.00     100.00     100.00 |    100.00
[35]:
tab mi_latlong year_SIF, column

+-------------------+
| Key               |
|-------------------|
|     frequency     |
| column percentage |
+-------------------+

           |                                   year_SIF
mi_latlong |      2012       2013       2014       2015       2016       2017       2018 |     Total
-----------+-----------------------------------------------------------------------------+----------
         0 |    85,451    171,915    172,726    182,957    136,345    212,986    187,372 | 1,149,752
           |     84.99      84.39      83.84      84.04      59.71      92.84      93.58 |     82.95
-----------+-----------------------------------------------------------------------------+----------
         1 |    15,089     31,809     33,302     34,734     92,019     16,437     12,860 |   236,250
           |     15.01      15.61      16.16      15.96      40.29       7.16       6.42 |     17.05
-----------+-----------------------------------------------------------------------------+----------
     Total |   100,540    203,724    206,028    217,691    228,364    229,423    200,232 | 1,386,002
           |    100.00     100.00     100.00     100.00     100.00     100.00     100.00 |    100.00

There are other ways to identify the latitude and longitude of these missing variables through a process called Geocoding. But we will explore that another time.

For now, we will drop observations that are still missing.

[36]:
drop if mi_latlong2 == 1
(161,734 observations deleted)

There are also some observations that have coordinates that are outside the boundaries of New York City. We will drop them as well.

[37]:
drop if longitude < -74.5 | longitude > -73
drop if latitude < 40 |  latitude > 41

(607 observations deleted)

(8 observations deleted)

Since, there are many observations, I will pull out a random sample of 2% of the data to quickly visualize. In Stata, we use preserve to keep the data at that point in memory and manipulate the data to do calculations/obtain weights/collapse/etc. Then, after we obtain our numbers/graphs/etc, we will use store to “get back” to the dataset that was at the point of preserve. We typically use preserve and restore commands when collapsing data.

[40]:
*preserve
sample 2
twoway scatter latitude longitude, msymbol(o) msize(tiny)
*restore

(1,199,181 observations deleted)



[38]:
display "$S_DATE $S_TIME"
notes: "Cleaned $S_DATE $S_TIME"
notes list

 7 May 2019 14:55:13



_dta:
  1.  "Downloaded 9 Dec 2018 11:30:26"
  2.  "Cleaned 7 May 2019 14:55:13"
[39]:
*save "..\working_data\NYPD_Motor_Vehicle_Collisions_clean.dta"